Skip to content

fix: restore per-cycle span duration for execute_event_loop_cycle#1939

Open
Di-Is wants to merge 3 commits intostrands-agents:mainfrom
Di-Is:fix/trace-span-base-exception-handling
Open

fix: restore per-cycle span duration for execute_event_loop_cycle#1939
Di-Is wants to merge 3 commits intostrands-agents:mainfrom
Di-Is:fix/trace-span-base-exception-handling

Conversation

@Di-Is
Copy link
Copy Markdown
Contributor

@Di-Is Di-Is commented Mar 20, 2026

Description

Since v1.24.0 (PR #1293), execute_event_loop_cycle spans no longer reflect per-cycle duration. When a cycle performs tool use and recurses, the parent cycle's native OTel span stays open until all recursive children complete, producing cumulative bottom-up latency instead of per-step latency in observability backends (Langfuse, Jaeger, etc.).

The root cause: event_loop_cycle() is an async generator whose body was wrapped in use_span(end_on_exit=True). Because yield keeps the context manager open across recursive cycles, all span.end() calls fire simultaneously when the generator chain unwinds. The logical metadata (gen_ai.event.end_time attribute) is set at the correct time, but backends use the native OTel endTimeUnixNano from span.end().

Three commits, each reviewable independently:

  1. fix: restore explicit span.end() to fix span end_time regression
  • Core fix. Switches to end_on_exit=False and restores explicit span.end() calls in end_event_loop_cycle_span() and end_model_invoke_span(), with end_span_with_error() on exception paths.
  1. fix: handle BaseException in trace spans to prevent span leaks on KeyboardInterrupt
  • With end_on_exit=False, spans must be explicitly closed on all paths.
  • Adds except BaseException handlers and aclose() for async generators to cover KeyboardInterrupt and asyncio.CancelledError.
  1. perf: only force flush tracer provider when ending agent spans
  • Independent optimization. _end_span() previously called force_flush() on every span end; this limits it to agent span completion only.
  • Happy to split this into a separate PR if preferred.

Resolve #1930, #1938

Related Issues

Documentation PR

N/A

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Di-Is added 3 commits March 20, 2026 10:59
PR strands-agents#1293 wrapped event_loop_cycle() in use_span(end_on_exit=True) and removed
explicit span.end() calls. Because event_loop_cycle is an async generator,
yield keeps the context manager open across recursive cycles, causing all
execute_event_loop_cycle spans to share the same OTel end_time.

Switch to end_on_exit=False and explicitly call span.end() via _end_span()
in end_event_loop_cycle_span() and end_model_invoke_span(), restoring
end_span_with_error() in all exception paths.
…boardInterrupt

Trace spans were not properly closed when BaseException (e.g. KeyboardInterrupt,
asyncio.CancelledError) was raised. Add explicit BaseException handlers to close
spans and aclose() calls to ensure async generators are cleaned up.
Reduce overhead by limiting force_flush calls to agent span completion
instead of every span end. Add flush parameter to _end_span() with
default False, passing True only from end_agent_span().
@Di-Is Di-Is temporarily deployed to manual-approval March 20, 2026 02:46 — with GitHub Actions Inactive
@Di-Is Di-Is changed the title Fix/trace span base exception handling Fix: trace span base exception handling Mar 20, 2026
@Di-Is Di-Is changed the title Fix: trace span base exception handling fix: restore per-cycle span duration for execute_event_loop_cycle Mar 20, 2026
@zastrowm
Copy link
Copy Markdown
Member

@Di-Is thanks for this; going to take a look.

Would it be possible to have a PR isolated to the span issue, and then spin the rest to a separate PR? It would make it easier to digest

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 95.77465% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/strands/event_loop/event_loop.py 94.64% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@lizradway lizradway self-requested a review March 31, 2026 14:20
flush: Force the tracer provider to flush after ending the span
"""
if not span:
if not span or not span.is_recording():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The check not span.is_recording() is a good defensive addition, but the docstring doesn't mention this guard condition.

Suggestion: Update the docstring to note that the method is a no-op if the span is not recording:

"""Generic helper method to end a span.

This method is a no-op if span is None or not recording.

Args:
    ...

@github-actions
Copy link
Copy Markdown

Re: Splitting the PR

I think this PR is appropriately scoped given that commits 1 and 2 are closely related:

  • Commit 1 switches to explicit span ending (end_on_exit=False)
  • Commit 2 handles the necessary consequence: without auto-exit, spans must be closed on all paths including BaseException

These are two halves of the same fix - separating them would leave the codebase in an incomplete state between merges.

Commit 3 (force_flush optimization) is indeed independent and could be split out if preferred, though it's a small, low-risk change.

The author has structured commits well for independent review - each is reviewable on its own with clear purpose.

@github-actions
Copy link
Copy Markdown

Review Summary

Assessment: Comment (with minor suggestions)

This PR correctly addresses a real regression in span timing for observability. The fix is well-reasoned: async generator yield semantics require explicit span ending rather than relying on context manager exit.

Review Details

Strengths

  • Root cause analysis in PR description is excellent - clearly explains the async generator timing issue
  • Test coverage is comprehensive with cases for aclose(), task cancellation, and KeyboardInterrupt
  • Commit organization makes review straightforward
  • Integration test validates actual span timing behavior

Suggestions (Non-blocking)

  • Code simplification: The except Exception + except BaseException pattern can be consolidated since both handlers do the same thing (see inline comment)
  • Documentation: Minor docstring update for _end_span to note the is_recording() guard

Questions for Author

The force_flush optimization in commit 3 - have you verified this doesn't impact span export reliability for short-lived agent invocations? The change seems reasonable but curious about your testing.

Overall this is a solid bug fix. Thank you for the detailed analysis and test coverage! 🎉

@lizradway
Copy link
Copy Markdown
Member

@Di-Is thanks for this; going to take a look.

Would it be possible to have a PR isolated to the span issue, and then spin the rest to a separate PR? It would make it easier to digest

+1

@github-actions
Copy link
Copy Markdown

Correction to my earlier review comment about exception handlers:

After re-analyzing the code more carefully, I realize my suggestion to consolidate all except Exception + except BaseException handlers was too broad.

Looking at the three locations:

  1. Lines 166-171: ✅ Could be simplified - both handlers do the same thing (end span + re-raise)

  2. Lines 233-250: ❌ Cannot be simplified - except Exception wraps errors in EventLoopException and yields ForceStopEvent, while except BaseException should NOT do this for KeyboardInterrupt/CancelledError

  3. Lines 395-427: ❌ Cannot be simplified - except Exception handles retry logic and invokes hooks, while except BaseException should skip retry logic for interrupts

The design is intentional and correct - BaseException subclasses like KeyboardInterrupt and asyncio.CancelledError should propagate immediately without being wrapped in EventLoopException or triggering retry logic.

Only the first location (lines 166-171) could be simplified if desired, but it's a minor point.

@github-actions
Copy link
Copy Markdown

Updated Review Summary

Assessment: Approve ✅

After re-reviewing with the latest changes and analyzing the exception handling patterns more carefully, this PR is well-designed:

Key Points

Exception Handling Design

The separate except Exception / except BaseException handlers are intentional in most places:

  • Exception handlers include retry logic, logging, and wrapping in EventLoopException
  • BaseException handlers ensure KeyboardInterrupt and asyncio.CancelledError propagate cleanly without retry attempts or exception wrapping

Test Coverage

The tests comprehensively cover:

  • Stream aclose() scenarios
  • Task cancellation
  • KeyboardInterrupt handling
  • Proper span timing across recursive cycles

Minor Note

Only lines 166-171 could potentially be simplified (both handlers do the same thing), but this is a very minor point.

The PR correctly fixes the span duration regression while handling all edge cases properly. Ready to merge! 🚀

@zastrowm
Copy link
Copy Markdown
Member

@github-actions github-actions bot added size/l and removed size/l labels Mar 31, 2026
)

self._end_span(span, attributes, error)
self._end_span(span, attributes, error, flush=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _end_span is a shared helper used by all span types (model, tool, cycle, agent). The flush param only applies to the agent span, so it feels out of place threading it through this parameter. Could just call force_flush() directly in end_agent_span after the _end_span call and remove the span-level flush in _end_span.

ie.

def end_agent_span(self, span, response=None, error=None):
    ...
    self._end_span(span, attributes, error)
    if self.tracer_provider and hasattr(self.tracer_provider, "force_flush"):
        try:
            self.tracer_provider.force_flush()
        except Exception as e:
            logger.warning("error=<%s> | failed to force flush", e)

I think this threading makes sense for introducing minimal changes/disruptions in the codebase, but it does give me some pause on clarity/maintainability.

@lizradway
Copy link
Copy Markdown
Member

Looks like test_event_loop_cycle_closes_spans_on_stream_aclose is failing, seems the model span isn’t being closed when the stream is closed. Only the cycle span gets cleaned up.

@zastrowm
Copy link
Copy Markdown
Member

zastrowm commented Apr 1, 2026

Looks like test_event_loop_cycle_closes_spans_on_stream_aclose is failing, seems the model span isn’t being closed when the stream is closed. Only the cycle span gets cleaned up.

I'm going to force push this to remove my commits and split out to a separate PR containing only the span fix

@zastrowm zastrowm force-pushed the fix/trace-span-base-exception-handling branch from ab54df7 to 1b7b6f2 Compare April 1, 2026 14:46
@github-actions github-actions bot added size/l and removed size/l labels Apr 1, 2026
@github-actions github-actions bot added size/l and removed size/l labels Apr 1, 2026
@zastrowm zastrowm force-pushed the fix/trace-span-base-exception-handling branch from 29eaae5 to 1b7b6f2 Compare April 1, 2026 15:13
@github-actions github-actions bot added size/l and removed size/l labels Apr 1, 2026
@zastrowm
Copy link
Copy Markdown
Member

zastrowm commented Apr 1, 2026

Hey @Di-Is , thanks for the fixes; I cherry-picked the duration fix to f​ix: restore explicit span.end() to fix span end_time regression by zastrowm · Pull Request #2032 · strands-agents/sdk-python and got it merged in. Sorry for all the chatter on this PR; eventually to get the fix in sooner than later we had to split it out, so I reverted my changes to this PR.

If you're willing, the other two commits would be good as standalone PRs; @ me regardding them if you spin them out

@poshinchen
Copy link
Copy Markdown
Contributor

@Di-Is I think we can close this one? I'm reviewing #2068

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Regression: event_loop_cycle span duration becomes cumulative across recursive cycles

4 participants